Client Report - Late Flights & Missing Data (JSON)

Course DS 250

Author

Dallin Moak

Source code

source code available at p3_source.py

Show the code
import os
from p3_source import df_raw

_ = os.getcwd()
df_raw
airport_code airport_name month year num_of_flights_total num_of_delays_carrier num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security num_of_delays_weather num_of_delays_total minutes_delayed_carrier minutes_delayed_late_aircraft minutes_delayed_nas minutes_delayed_security minutes_delayed_weather minutes_delayed_total
0 ATL Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January 2005.0 35048 1500+ -999 4598 10 448 8355 116423.0 104415 207467.0 297 36931 465533
1 DEN Denver, CO: Denver International January 2005.0 12687 1041 928 935 11 233 3153 53537.0 70301 36817.0 363 21779 182797
2 IAD January 2005.0 12381 414 1058 895 4 61 2430 NaN 70919 35660.0 208 4497 134881
3 ORD Chicago, IL: Chicago O'Hare International January 2005.0 28194 1197 2255 5415 5 306 9178 88691.0 160811 364382.0 151 24859 638894
4 SAN San Diego, CA: San Diego International January 2005.0 7283 572 680 638 7 56 1952 27436.0 38445 21127.0 218 4326 91552
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
919 IAD Washington, DC: Washington Dulles International December 2015.0 2799 182 183 61 0 17 443 NaN 15438 2826.0 0 1825 31164
920 ORD Chicago, IL: Chicago O'Hare International December 2015.0 25568 923 1755 1364 11 180 4233 80962.0 132055 72045.0 435 22459 307956
921 SAN San Diego, CA: San Diego International n/a 2015.0 6231 480 606 256 5 37 1383 25402.0 35796 9038.0 161 2742 73139
922 SFO San Francisco, CA: San Francisco International December 2015.0 13833 757 1180 2372 9 147 4465 55283.0 96703 193525.0 285 13788 359584
923 SLC Salt Lake City, UT: Salt Lake City International December 2015.0 8804 483 796 404 5 56 1745 37354.0 49549 13515.0 158 6693 107269

924 rows × 17 columns

Elevator pitch

The dataset from from BTS represents data about flight delays over a decade from seven airports. The data indicates that the san Francisco airport has the most delays of any length, but that fact obscures the fact that there is a large variability in the length of the delays. If an airport has 100% of its flights delayed for just 1 minute, that wouldn’t really be a big deal. The most important metric probably represent delayed flights measured as a count against on-time flights, but moderated by the amount of time the thing was delayed. Further research is indicated. This anylist will request a grant for more study of this dataset.

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.

So I went through the dataset and made sure every instance (row) had all properties (column) and that if the value isn’t present it was set to pandas.NA. After that, I checked to see if any of the properties that ought to be ints are failing integer casting, and attempted to fix them by casting that value as a string, removing all non-numeric characters from the value and casting it back to an int. I then handle all instances of empty strings by setting them to pandas.NA and then cast every column to its proper type. I also removed all instances(rows) that didn’t have a clear month, assuming that rows that don’t give a discernable month are useless and a potential skew of the data. Then i fixed a misspelling of february i found. If this was a real project, I’d create a supplamentary table to map airport codes to airport names, but that wasn’t part of the assignment

Show the code
from p3_source import q1_example, df
from pandas import NA, isna

q1_example.head().style.format(na_rep="NaN")
  airport_code airport_name month year num_of_flights_total num_of_delays_carrier num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security num_of_delays_weather num_of_delays_total minutes_delayed_carrier minutes_delayed_late_aircraft minutes_delayed_nas minutes_delayed_security minutes_delayed_weather minutes_delayed_total
2 IAD NaN January 2005.000000 12381 414 1058 895 4 61 2430 NaN 70919 35660.000000 208 4497 134881
9 IAD Washington, DC: Washington Dulles International February 2005.000000 10042 284 631 691 4 28 1639 15573.000000 39840 NaN 169 1359 78878
12 SFO San Francisco, CA: San Francisco International February 2005.000000 9327 599 457 1010 0 57 2122 30760.000000 27302 NaN 6 3178 110995
13 SLC NaN February 2005.000000 12404 645 463 752 10 79 1947 32336.000000 23087 24544.000000 293 4614 84874
14 ATL Atlanta, GA: Hartsfield-Jackson Atlanta International March 2005.000000 37806 1462 999 5697 11 423 9431 NaN 150766 396191.000000 568 33379 691887

QUESTION|TASK 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

pandas dataframe styler is giving me trouble, but here’s the data that indicates that SFO airport leads in both average time of delay per flight through all months and percent of flights that have a delay

Show the code
# Include and execute your code here
from p3_source import q2_data


def hours_format(x):
    if pd.isna(x):
        return ""
    h = int(x)
    m = int(round((x - h) * 60))
    return f"{h}h {m}m" if m else f"{h}h"


def percent_format(x):
    return f"{x*100:.1f}%" if pd.notna(x) else ""


q2_data.style.format(
    {"delay_percent": percent_format, "avg_delay_per_flight_hours": hours_format}
)

q2_data
avg_delay_per_flight total_flights total_delays delay_percent avg_delay_per_flight_hours
airport_code
ATL 11.971153 4295863 868546 0.202182 0.199519
DEN 9.939602 2455899 455603 0.185514 0.165660
IAD 11.985960 831904 164936 0.198263 0.199766
ORD 15.514155 3516798 815073 0.231766 0.258569
SAN 8.978680 877340 167483 0.190899 0.149645
SFO 16.175981 1591177 415090 0.260870 0.269600
SLC 7.176067 1347833 196873 0.146066 0.119601

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

I’ve already excluded rows that didn’t provide month data, so all records represent an individual month/airport combo. Since the question specified delays of any length, delay time will not be considered. based on sums of delayed flights from all airports for each month combined with sums of total flights, it seems clear that september at 16.5% missed flights and december at 16.7% missed flights.

Show the code
from p3_source import q3_data, q3_chart

q3_data
total_flights total_delays delay_percent month_no month
4 1193018 265001 0.222127 1 January
3 1115814 248033 0.222289 2 February
7 1213370 250142 0.206155 3 March
0 1259723 231408 0.183698 4 April
8 1227795 233494 0.190173 5 May
6 1305663 317895 0.243474 6 June
5 1371741 319960 0.233251 7 July
1 1335158 279699 0.209488 8 August
11 1227208 201905 0.164524 9 September
10 1301612 235166 0.180673 10 October
9 1185434 197768 0.166832 11 November
2 1180278 303133 0.256832 12 December
Show the code
q3_chart

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

a. 100% of delayed flights in the Weather category are due to weather  
b. 30% of all delayed flights in the Late-Arriving category are due to weather  
c. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%    

I created a map (average_weather_rates) based on the rules above that gives the average percent of late_aircraft or nas delays by month, and then created a dataframe that puts the late_aircraft multipled by the percent for the row’s month in a late aircraft weather column, and the same for nas delays. I also added columns that sums late aircraft weather delays and nas weather delays, and one that sums that sum and the actual weather delays to get total mild and severe weather delays.

The values for the mild weather delays are by nesessity not integers, because they’re based on extrapolation from the monthly weather delay proportions from the rules.

Assuming the rates in the rules come from averages based on data that is representative of the flight data in the dataset, then my data faithfully extrapolates (approximate) values based on those average rates. I’m leaving them as floating-point values instead of rounding to integers so as to not obfuscate information.

Show the code
from p3_source import q4_data

q4_data
airport_code month year total_flights total_delays late_delays late_weather_ext nas_delays nas_weather_ext weather_mild_total weather_severe weather_all
0 ATL January 2005.0 35048 8355 999 299.7 4598 1839.2 2138.9 448 2586.9
1 DEN January 2005.0 12687 3153 928 278.4 935 374.0 652.4 233 885.4
2 IAD January 2005.0 12381 2430 1058 317.4 895 358.0 675.4 61 736.4
3 ORD January 2005.0 28194 9178 2255 676.5 5415 2166.0 2842.5 306 3148.5
4 SAN January 2005.0 7283 1952 680 204.0 638 255.2 459.2 56 515.2
... ... ... ... ... ... ... ... ... ... ... ... ...
892 DEN December 2015.0 18001 3610 1463 438.9 981 392.4 831.3 152 983.3
893 IAD December 2015.0 2799 443 183 54.9 61 24.4 79.3 17 96.3
894 ORD December 2015.0 25568 4233 1755 526.5 1364 545.6 1072.1 180 1252.1
895 SFO December 2015.0 13833 4465 1180 354.0 2372 948.8 1302.8 147 1449.8
896 SLC December 2015.0 8804 1745 796 238.8 404 161.6 400.4 56 456.4

897 rows × 12 columns

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.

I agregated the weather breakdown data from q4 by airport, and then melted it so that there was a row for each airport/weather category combo (ATL Mild, ALT Severe, DEN Mild, DEN Severe, ect). then I stacked ’em so you could see them as sums. this view allowed me to see both the relative rates for each category of weather delay, but also the proportion of mild to severe weather delays. it looks like there were about 1 severe weather delays for every 10 mild weather delays. The two highest weather delays, san francisco and chicago make sense to me; the fog in SF is famous and chicago’s got all that cold wind off lake michigan

Show the code
from p3_source import q5_data_melted, q5_chart

q5_data_melted
airport_code delay_type delay_rate
0 ATL weather_mild_delay_rate 0.060603
7 ATL weather_severe_delay_rate 0.007338
1 DEN weather_mild_delay_rate 0.051904
8 DEN weather_severe_delay_rate 0.005513
2 IAD weather_mild_delay_rate 0.054624
9 IAD weather_severe_delay_rate 0.005645
3 ORD weather_mild_delay_rate 0.075767
10 ORD weather_severe_delay_rate 0.005789
4 SAN weather_mild_delay_rate 0.046532
11 SAN weather_severe_delay_rate 0.004734
5 SFO weather_mild_delay_rate 0.085776
12 SFO weather_severe_delay_rate 0.006307
6 SLC weather_mild_delay_rate 0.035481
13 SLC weather_severe_delay_rate 0.004959
Show the code
q5_chart

STRETCH QUESTION|TASK 1

Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.

type your results and analysis here

Show the code
# Include and execute your code here